Word frequency
Definition
Word frequency refers to the number of times a word occurs in a corpus of texts. Less frequent words in a reference corpus (e.g., edifice, cuisine, egregious) are considered more sophisticated than frequently occurring words (e.g., building, food, bad).
Methodology
Word frequency indices are calculated as the mean frequency score of words in a given text, using various reference corpora. Frequencies can be represented in several forms:
- Raw frequency: The sum of word frequency counts divided by the number of words in the text.
- Logarithmic frequency: A log transformation is applied to raw frequency scores to reduce the influence of extremely frequent words and normalize the distribution.
- Lemma frequency: Frequencies are computed based on the base (dictionary) form of words, rather than surface forms (e.g., run, runs, and ran are counted as run).
- Lemma type frequency: Frequency values are calculated based on unique lemma types in the text, rather than tokens.
Each index can be also computed for the following options: All words; Content words (CW); Function words (FW).
Corpus used
- BNC
- KF
- COCA
- HAL
- Brown
- TL
- SUBTLEXus
- TOEFL11
Calculated indices
- Replace
[ ]with register (e.g., academic, fiction, spoken, written)
BNC
The British National Corpus (BNC) is a 100-million-word collection of samples from a wide range of written and spoken British English from the late 20th century.
Raw frequency
- Indices:
- BNC_[ ]_Freq_AW
- BNC_[ ]_Freq_CW
- BNC_[ ]_Freq_FW
Logarithmic frequency
- Indices:
- BNC_[ ]_Freq_AW_Log
- BNC_[ ]_Freq_CW_Log
- BNC_[ ]_Freq_FW_Log
KF
Kucera-Francis (KF) Corpus is based on the Brown Corpus and provides frequency norms from American English texts published around 1961.
Raw frequency
- Indices:
- KF_Freq_AW
- KF_Freq_CW
- KF_Freq_FW
Logarithmic frequency
- Indices:
- KF_Freq_AW_Log
- KF_Freq_CW_Log
- KF_Freq_FW_Log
COCA
The Corpus of Contemporary American English (COCA) includes more than one billion words from spoken, fiction, magazine, newspaper, and academic texts, offering frequency data for a variety of registers.
Raw frequency
- Indices:
- COCA_[ ]_Frequency_AW
- COCA_[ ]_Frequency_CW
- COCA_[ ]_Frequency_FW
Logarithmic frequency
- Indices:
- COCA_[ ]_Frequency_Log_AW
- COCA_[ ]_Frequency_Log_CW
- COCA_[ ]_Frequency_Log_FW
Lemma frequency
- Indices:
- COCA_[ ]_lemma_frequency_AW
- COCA_[ ]_lemma_frequency_CW
- COCA_[ ]_lemma_frequency_FW
Logarithmic lemma frequency
- Indices:
- COCA_[ ]_lemma_frequency_Log_AW
- COCA_[ ]_lemma_frequency_Log_CW
- COCA_[ ]_lemma_frequency_Log_FW
Lemma type frequency
- Indices:
- COCA_[ ]_lemma_frequency_AW_TP
- COCA_[ ]_lemma_frequency_CW_TP
- COCA_[ ]_lemma_frequency_FW_TP
Logarithmic lemma type frequency
- Indices:
- COCA_[ ]_lemma_frequency_Log_AW_TP
- COCA_[ ]_lemma_frequency_Log_CW_TP
- COCA_[ ]_lemma_frequency_Log_FW_TP
HAL
Hyperspace Analogue to Language (HAL) is a large-scale co-occurrence-based corpus providing frequency and contextual diversity measures for words based on their surrounding lexical context.
Raw frequency
- Indices:
- Freq_HAL_AW
- Freq_HAL_CW
- Freq_HAL_FW
Logarithmic frequency
- Indices:
- Log_Freq_HAL_AW
- Log_Freq_HAL_CW
- Log_Freq_HAL_FW
Brown
The Brown Corpus is a pioneering corpus of American English texts published in 1961, offering a balanced range of genres for word frequency analysis.
Raw frequency
- Indices:
- Brown_Freq_AW
- Brown_Freq_CW
- Brown_Freq_FW
Logarithmic frequency
- Indices:
- Brown_Freq_AW_Log
- Brown_Freq_CW_Log
- Brown_Freq_FW_Log
TL
Thorndike-Lorge (TL) Corpus includes frequency counts based on popular magazine articles and printed materials, designed to inform readability and educational materials.
Raw frequency
- Indices:
- TL_Freq_AW
- TL_Freq_CW
- TL_Freq_FW
Logarithmic frequency
- Indices:
- TL_Freq_AW_Log
- TL_Freq_CW_Log
- TL_Freq_FW_Log
SUBTLEXus
SUBTLEXus is a subtitle-based corpus capturing word usage in spoken American English, ideal for understanding conversational frequency.
Raw frequency
- Indices:
- SUBTLEXus_Freq_AW
- SUBTLEXus_Freq_CW
- SUBTLEXus_Freq_FW
Logarithmic frequency
- Indices:
- SUBTLEXus_Freq_AW_Log
- SUBTLEXus_Freq_CW_Log
- SUBTLEXus_Freq_FW_Log
TOEFL11
The TOEFL11 Corpus is a learner corpus containing essays written by English language learners categorized by proficiency levels and L1 background. WC stands for Whole Corpus, which means it includes essays from all proficiency levels. In this Corpus, learner essays are grouped into High, Medium, and Low proficiency levels, which are based on the average of two human-assigned scores (each on a 5-point scale): High: average score of 4.0 or higher, Medium: average score between 3.0 and 3.9, Low: average score below 3.0.
Raw frequency
- Indices:
- NNS_Raw_Freq_[High/Med/Low/WC]_AW
- NNS_Raw_Freq_[High/Med/Low/WC]_CW
- NNS_Raw_Freq_[High/Med/Low/WC]_FW
Logarithmic frequency
- Indices:
- NNS_Raw_Freq_[High/Med/Low/WC]_AW_log
- NNS_Raw_Freq_[High/Med/Low/WC]_CW_log
- NNS_Raw_Freq_[High/Med/Low/WC]_FW_log
Lemma frequency
- Indices:
- NNS_Lemma_Freq_[High/Med/Low/WC]_AW
- NNS_Lemma_Freq_[High/Med/Low/WC]_CW
- NNS_Lemma_Freq_[High/Med/Low/WC]_FW
Logarithmic lemma frequency
- Indices:
- NNS_Lemma_Freq_[High/Med/Low/WC]_AW_log
- NNS_Lemma_Freq_[High/Med/Low/WC]_CW_log
- NNS_Lemma_Freq_[High/Med/Low/WC]_FW_log
Lemma frequency types
- Indices:
- NNS_Lemma_Freq_Types_[High/Med/Low/WC]_AW
- NNS_Lemma_Freq_Types_[High/Med/Low/WC]_CW
- NNS_Lemma_Freq_Types_[High/Med/Low/WC]_FW
Logarithmic lemma frequency types
- Indices:
- NNS_Lemma_Freq_Types_[High/Med/Low/WC]_AW_log
- NNS_Lemma_Freq_Types_[High/Med/Low/WC]_CW_log
- NNS_Lemma_Freq_Types_[High/Med/Low/WC]_FW_log
References
- BNC Consortium. (2007). British national corpus. Oxford Text Archive Core Collection.
- Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2013). TOEFL11: A corpus of non‐native English. ETS Research Report Series, 2013(2), i-15. https://doi.org/10.1002/j.2333-8504.2013.tb02331.x
- Brown, G. D. (1984). A frequency count of 190,000 words in the London-Lund Corpus of English Conversation. Behavior Research Methods, Instruments, & Computers, 16(6), 502-532. https://doi.org/10.3758/BF03200836
- Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International journal of corpus linguistics, 14(2), 159-190. https://doi.org/10.1075/ijcl.14.2.02dav
- Kučera, H., Francis, W., Twaddell, W. F., Marckworth, M. L., Bell, L. M., & Carroll, J. B. (1967). Computational analysis of present-day American English. Brown University Press.
- Thorndike, E. L., & Lorge, I. (1944). The teacher's word book of 30,000 words. Bureau of Publications, Teachers Co.